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Abstract 
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Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) 
in "flat" three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every 
time step in the dynamical core of many numerical weather prediction (NWP) models, and equations of a very similar structure arise 
I in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of 
the forecast, and to meet operational requirements an algorithmically optimal method has to be used and implemented efficiently. 
(X) Graphics Processing Units (GPUs) have been shown to be highly efficient (both in terms of absolute performance and power 
consumption) for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these 
architectures. In this article we describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) 
algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation 
exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is 
memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by 
using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. 
Global memory access can also be reduced by rewriting the PCG algorithm using loop fusion and we show that this further reduces 
the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it both to a sequential CPU 
fl") implementation and to a matrix-explicit GPU code which uses existing CUDA libraries. The absolute performance of the algorithm 
C^~) for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth. 



> 1 Introduction coupling by several orders of magnitude. To achieve optimal 

performance, it is important to exploit the strong anisotropy 

Anisotropic elliptic PDEs arise in many areas of geophysical of the system when choosing an appropriate computational 

and atmospheric modelling, which are often characterised grid and an efficient solver, 
by "flat" geometries: the horizontal extent of the domain of 

interest is much larger than its vertical size. This is the case In this work we focus on the elliptic PDE for the pres- 

for global weather- and climate prediction models. As the sure correction arising in the dynamical core of numerical 

height of the atmosphere is significantly smaller than the ra- weather- and climate- prediction models. In many forecast 

dius of the earth, the horizontal resolution is of the order of models semi-implicit scmi-Lagrangian time stepping intro- 

10-25 kilometers but the vertical grid spacing is several tens duced in [T] and [2] is used to advance the atmospheric fields 

or hundreds of metres. Similar ranges of scales are encoun- forward in time. In contrast to explicit time stepping this 

tered in models for simulating global ocean currents. Due method has a larger stability region and allows for longer 

to the layered structure of geological formations, oil and gas model time steps without compromising the accuracy of the 

reservoir simulations and subsurface flow models of aquifers large scale dynamics, which can reduce the overall model 

are also typically carried out in "flat" domains. After dis- runtime. However, if this approach is used to solve the fully 

cretisation the cells of the computational grid are very flat compressible non- hydrostatic Euler equations, a three di- 

and the resulting matrix stencil is highly anisotropic, i.e. mensional PDE for the pressure correction has to be solved 

the coupling in the vertical direction exceeds the horizontal at every time step as discussed for example in [31 HI 03 GO E] , 
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which often forms the computationally most expensive pro- 
portion of the model runtime. 

Algorithmically the most efficient solvers for large ellip- 
tic PDEs are suitably preconditioned Krylov subspace- or 
multigrid methods (see e.g. [3 EH OH E2])- The strong 
anisotropy in the vertical direction can be exploited to con- 
struct an efficient preconditioner (or multigrid smoother) 
based on vertical line relaxation as discussed in j3] |5]. In 
an related context [13] and [14] describe how the equa- 
tions of ocean flows can be discretised on a tensor prod- 
uct grid which is unstructured in the horizontal but con- 
sists of regular columns in the vertical direction. Again the 
strong anisotropy is used in the construction of an efficient 
preconditioner of the iterative solver. In a similar fashion 
anisotropic elliptic PDEs arise in fully implicit methods for 
gas- and oil reservoir modelling. A "supercoarsening" multi- 
grid algorithm for solving elliptic PDEs encountered in mul- 
tiphase flow in porous media is described by [T5] : while the 
full three dimensional equation is solved on the finest grid, 
any vertical variations are averaged out on the coarser multi- 
grid levels by collapsing vertical columns to a single layer. 

The exact hardware on which forecast models will be im- 
plemented in the future is currently unknown, and it is 
important to explore novel chip architectures in addition 
to standard CPUs. Graphics Processing Units (GPUs) are 
fast and power- efficient computing devices and significant 
speedups relative to standard CPU implementations have 
been achieved in the past for iterative solvers for elliptic 
PDE, as described in (TH [TT1 HH1 UHl HH H21 EH HI] - 

While modern multicore CPUs contain several tens of 
cores and have a peak floating point performance of 
O(10 - 100) GFLOPs, GPUs have several hundreds to thou- 
sands of cores and applications have to make efficient use 
of the massively parallel SIMD architecture and limited 
cache size per thread. The nVidia M2090 Fermi GPU, on 
which this work was carried out, has a peak performance of 
1.331/0.665 TFLOP/s in single and double precision respec- 
tively and a global memory bandwidth of 177GByte/s. By 
dividing the peak FLOP rate by the memory bandwidth on 
the GPU, one can deduce that the number of computations 
per floating point variable read from memory is around 30, 
so computations are essentially "free" and the performance 
is limited by the speed with which data can be read from 
global memory and how efficiently it can be kept in cache. 

In this article we describe a matrix-free GPU implementa- 
tion of a preconditioned Conjugate Gradient (PCG) solver 
tailored towards the solution of anisotropic PDEs with a 
tensor-product structure. The most compute intensive com- 
ponents of the iterative solver are the evaluation of a large 
sparse matrix-vector product (SpMV) and the inversion of 
a block-tridiagonal matrix. Both kernels were ported to the 
GPU and the memory access pattern and thread layout were 
adapted to increase data throughput. In our implementa- 
tion the matrix is not stored explicitly but recalculated in 
every grid cell, which reduces access to global memory com- 
pared to the matrix-explicit code. However the stencil we 



use is more complicated than the simple Poisson stencil on 
a regular grid used in previous studies (see [171 H^] V Due to 
the tensor product structure of the equation and the com- 
putational grid, the matrix can be written as the product of 
a one dimensional vertical discretisation, which is the same 
for every column and only needs to be calculated and stored 
once, and a horizontal stencil. As the number of vertical 
levels is very large in atmospheric applications, and the hor- 
izontal coupling only needs to be calculated once for each 
vertical column, this creates only a small overhead. The 
vertical discretisation requires the storage of four vectors of 
length n z , where n z is the number of vertical levels. In total 
we store 4 x n z values to parametrise the matrix. With the 
exception of [T?l HU] , all implementations discussed in the 
literature that we are aware of, store the matrix explicitly, 
which requires the storage of 7 x rihoriz X n z matrix entries, 
where nhoriz is the number of horizontal grid cells. This 
can have a negative impact on the performance on band- 
width limited architectures as it requires reading the matrix 
stencil from global memory in addition to the field vectors. 
While the specific implementation described in this article 
is based on a three dimensional grid which can be written as 
the tensor product of regular horizontal grid and a graded 
vertical grid, the method we present can also be applied to 
unstructured horizontal grids. 

As the sparse matrix-vector product and preconditioner 
solve are highly efficient in our GPU implementation, other 
parts of the main CG such as level 1 BLAS vector updates 
and scalar products start to account for a significant pro- 
portion of the runtime, even if they are implemented using 
optimised GPU libraries such as CUBLAS. We find that 
to achieve further performance increases it is not sufficient 
to optimise the kernels in isolation, but rather several 
components of the main CG iteration need to be considered 
together, as has been suggested in [20 . In particular we 
find that the number of memory references can be reduced 
further by fusing the loops over the computational grid in 
the main kernels with the BLAS operations and this can 
lead to an additional performance gain of around 30%. For 
the entire PCG algorithm we are able to obtain a total 
speedup of a factor 60 x for single precision arithmetic 
on a nVidia Fermi M2090 card relative to one core of 
an Intel Xeon Sandybridge E5-2620 CPU. For double 
precision arithmetic the speedup was slightly smaller with 
48 x. This includes time for setting up the discretisation 
and copying data between host and device. To study the 
performance of our matrix-free GPU code we compared it 
to an implementation which stores the matrix explicitly in 
the compressed sparse row storage (CSR) format using the 
CUSPARSE and CUBLAS libraries. Our matrix-free code is 
significantly faster than the implementation based on CRS 
data structures which does not exploit the regular structure 
of the problem. We quantified the absolute performance in 
terms of floating point operations per second (FLOPs) and 
global memory bandwidth for different problem sizes, where 
the latter is the more relevant measure for the performance 
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of a memory bound algorithm. The optimised matrix-free 
code achieves a bandwidth of around 25% — 50% of the 
theoretical peak value, which is a sizeable proportion but 
shows that theoretically there is still potential for additional 
improvements which could lead to a further speedup of a 
factor 2 x — 4x. An idea of how this could be achieved 
by increasing the granularity of the algorithm is discussed 
below. The floating point performance is 70-80 GFLOPs 
for single precision and 40-50 GFLOPs for double precision, 
corresponding to around 5—8% of the theoretical peak value. 



Overview. This article is organised as follows: previous 
GPU implementations of iterative solvers for PDEs are re- 
viewed in section [2] In section [3] the model equation and its 
discretisation is described in detail with particular emphasis 
on the tensor-product structure of the grid and the elliptic 
operator. Preconditioned Krylov-subspace solvers and the 
matrix-free and interleaved form of the PCG algorithm for 
solving the model equation are presented in section [4] A 
general overview over the GPU architecture and the CUDA 
programming- and execution model can be found in sec- 
tion [5] and our CUDA implementation of the PCG solver 
is described in section [6] Performance measurements are 
discussed in section [7] where we also present comparisons to 
a matrix-explicit implementation and quantify the absolute 
performance. Our conclusions and a discussion of planned 
further work can be found in section [5J For reference ap- 
pendix [X] contains the explicit form of the two most impor- 
tant kernels of our optimised algorithm. 

2 Previous work 

The GPU implementation of Krylov-subspace solvers and in 
particular of the Preconditioned Conjugate Gradient algo- 
rithm has been studied extensively in the literature, both 
for more general sparse matrices and for matrices arising 
from the discretisation of elliptic PDEs. As far as we are 
aware, all implementations discussed in the literature (with 
the exception of |19j and |17| ) are based on matrix-explicit 
representations. While some of the authors study Poisson- 
or sign-positive Helmholtz equations, none of the problems 
studied in the literature show the strong anisotropy which 
characterises the elliptic operator we consider in this work, 
and hence the preconditioners investigated in the literature 
will not be optimal in our case. 

While the speedups presented in the following review de- 
pend on the problem and on the hardware used at a par- 
ticular time and should only be used as an indicator for 
achievable performance gains, almost all GPU implementa- 
tions are significantly faster than the corresponding CPU 
versions with speedups of 20 x — 40 x relative to the sequen- 
tial code. 

Some early work is presented in |16| where both a con- 
jugate gradient and a multigrid solver are implemented for 



solving the sign positive Helmholtz equation — Au + au = g 
arising from an implicit time discretisation of the incom- 
pressible Navier-Stokes equations on a regular two dimen- 
sional grid. However, for both solvers the matrix is stored 
explicitly. 

The compressed sparse row storage format (CSR) is a very 
popular and general format which has been used in a vari- 
ety of recent GPU implementations of Krylov subspace algo- 
rithms. A PCG solver for the same sign-positive Helmholtz 
equation as in [TB] is described in [23 for two and three 
dimensions. For the approximate inverse SSOR precondi- 
tioner, which requires an additional matrix multiplication, 
a socket-to-socket speedup of more than 8 x is reported for 
the best implementation of the PCG algorithm on an nVidia 
Tesla T10 card relative to the unpreconditioned CG algo- 
rithm on an Intel Xeon Quad-Core 2.66 GHz CPU. Although 
in contrast to [16 j a three dimensional system is solved, the 
elliptic operator considered is fully isotropic in both cases. 
In [2U] a modified version of the Conjugate Gradient algo- 
rithm is used for solving a set of general matrices from the 
University of Florida sparse matrix collection described in 
|26| . A simple Jacobi preconditioncr is used and the per- 
formance of the solver is optimised by using the prefetch 
CSR sparse matrix-vector multiplication in [37] and fusing 
kernels in the main PCG loop. As described in section |4.3| 
below we use a similar technique for fusing different ker- 
nels in our implementation to improve the performance of 
the code. Together with some other improvements the au- 
thors of [2U] report that this led to a significant speedup 
compared to a GPU implementation using the "Row per 
warp" sparse matrix- vector multiplication described in |28| . 
One of the problems studied in [2U] is the "thermal2" matrix 
which arises from an FEM discretisation of the stationary 
heat equation d x (kd x T) + d y (kd y T) = on an unstructured 
two dimensional grid. For this problem a speedup of 41 x 
relative to a single core of a Intel Core2 2.4GHz could be 
achieved on both nVidia GT8800 and GTX280 CPUs. The 
GPU implementation of a Krylov subspace solver for the 
(sign-indefinite) two dimensional Helmholtz equation is de- 
scribed in [21J. A shifted Laplace multigrid preconditioner 
is used to reduce the number of iterations and a speedup of 
around 30 x could be achieved on an nVidia GeForce 9800 
GTX/9800 GTX+ GPU relative to the sequential imple- 
mentation on one core of an AMD Phcnom 9850 CPU. 

Other sparse matrix storage formats have also been used 
to implement iterative solvers on CPUs. An implementation 
of a CG solver for the two dimensional PDE arising from the 
implicit time discretisation of the heat equation is described 
in The ELLPACK-R data format, which is more suit- 
able for structured problems, was used for storing the matrix 
and a speedup of a factor 26 x could be achieved for a two di- 
mensional problem of size 2048 x 2048 on an nVidia GeForce 
GTX 480 card, relative to the sequential implementation 
on an Intel Core i7 860 CPU with 2.80GHz. Although the 
structure of the five point nearest-neighbour stencil arising 
from a finite-difference approximation of the Poisson equa- 
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tion is similar to the stencil we use in our discretisation, 
in contrast to our problem the elliptic PDE solved in [25] 
is two dimensional and fully isotropic. The GPU imple- 
mentation of preconditioned GMRES and Conjugate Gra- 
dient solvers for a range of problems and preconditioners 
has been studied in [25] and the performance for different 
sparse matrix storage formats is compared. While in some of 
the implementations only the sparse-matrix vector product 
is carried out on the device and the preconditioner is exe- 
cuted on the host, preconditioners that are easier to paral- 
lelise are also ported to the GPU. However, the authors find 
that for a simple block-Jacobi preconditioner implemented 
on the GPU the number of iterations is very large. This 
should be compared to our implementation: for the strongly 
anisotropic elliptic PDE we consider the blocks have a direct 
physical interpretation as they described the strong vertical 
coupling within one column, which is much larger than the 
coupling between different blocks. As a result, the simple 
block-diagonal preconditioner proved to be very efficient in 
our numerical tests. In [22] the GPU implementation of a 
Preconditioned Conjugate Gradient solver for both a two di- 
mensional Poisson problem and the elliptic equation arising 
in the Variational Boussinesq Model (VBM) is described. 
A Repeated Red Black (RRB) preconditioner is used, but 
an incomplete Poisson preconditioner with diagonal scaling 
is also considered. As for the sparse approximate inverse 
used in [53], the incomplete Poisson preconditioner can be 
reduced to an additional sparse matrix product. The sparse 
matrix-vector product is implemented by storing the local 
five-point stencil at each gridpoint. For the RRB precondi- 
tioner a speedup of around 40 x could be achieved for the 
Poisson test problem on an nVidia GeForce GTX 580 card, 
relative to the sequential implementation on an Intel Xeon 
W3520 CPU. However, the costs for memory allocation and 
setup of the preconditioner matrix take up around a third 
of the total runtime. In contrast in our matrix free im- 
plementation only a small amount of data has to be copied 
between host and device and the matrix setup costs are neg- 
ligible. For realistic problems the speedup reported in [22] 
is 20 x — 30 x for the RRB preconditioner and 5 x — 20 x for 
the incomplete Poisson preconditioner. 

As far as we are aware, the only matrix-free implementa- 
tion of a CG solver discussed in the literature are [T71 [T5] . In 
[T7] both the homogenous Poisson equation and the Navier 
Stokes equation are solved on an unstructured mesh by im- 
plementing matrix-free gradient and divergence operations. 
On an nVidia a speedup of around 3x was achieved on an 
nVidia 6600GT card relative to an AMD Athlon 64 CPU. 
Note, however, that the implementation is based on the low- 
level graphics API and the hardware used in the study is 
quite dated by current standards. The GPU implementa- 
tion of a matrix-free PCG solver for the homogenous Poisson 
equation in three dimensions is also described in |19| . 

Due to its significance in many scientific applications and 
in particular iterative solvers, the performance of sparse 
matrix-vector multiplications on its own has been studied 



extensively in the literature: Various sparse matrix formats 
are described in |28j and their performance for a sparse 
matrix-vector multiplication is compared for both struc- 
tured and unstructured matrices. While CSR is the most 
general format and can be applied to matrices with widely 
varying row sizes, the best performance for structured ma- 
trices arising from the discretisation of PDEs is obtained 
with the DIA and ELLPACK formats. However, in [21] 
an efficient parameter dependent implementation of sparse 
matrix-vector multiplication based on the CSR format on 
cache based GPUs is described. Cache usage and perfor- 
mance can be improved significantly by varying the number 
of threads processing each row, the thread block size and 
number of rows processed by one cooperating thread group. 
The authors find that by tuning the parameters heuristi- 
cally, the performance of a Conjugate Gradient solver for 
a structured problem arising from the finite element dis- 
cretisation of a simple elliptic Poisson problem using CSR 
storage is comparable to the corresponding implementation 
using the ELLPACK format. Both matrix-explicit versions 
are beaten by an implementation which stores a small local 
matrix on each element and assembles the global stiffness 
matrix on-the-fly in each matrix-vector product as described 
in |30l 131] . For other work on improved matrix-explicit im- 
plementations of the sparse matrix-vector product see the 
review and references cited in |24| . 

The number of iterations can often be reduced signifi- 
cantly by using multigrid methods, and recently both geo- 
metric and algebraic multigrid solvers have been ported to 
GPUs, see e.g. [321 E3 EH- The extension of our PCG 
solver to a geometric multigrid solver for anisotropic prob- 
lems based on the tensor product idea in [33] is discussed in 
|36| and we are currently working on a GPU implementation 
of the same matrix- free geometric multigrid solver. 

We finally remark that multiple-CPU implementations of 
iterative solvers have been described in the literature (see 
e.g. [37] [35] EE |3S]) and we are currently working on ex- 
tending our algorithm to clusters of GPU. 

3 Model equation 

Following [7] [S] a model equation which reproduces the most 
important features of the full PDE for the pressure correc- 
tion in a NWP model, has been derived in [36]. The deriva- 
tion of the full pressure correction equation in atmospheric 
models can be found for example in [3J |3] [3] [B] and is de- 
scribed for ocean models in [T3] and [TJ. The model equa- 
tion we use is a symmetric positive definite PDE and can be 
written in spherical coordinates as 

-^(aw + A"* »(,»»)). + .-/ (d 

where A 2( j denotes the two dimensional Laplacian on the 
unit sphere. For simplicity all length scales arc measured 
in units of the earth radius i? C arth, and the equation is 
solved in a thin spherical shell re [1, 1 + ff atmos ]. We write 
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H a t m os — D / Rcarth *C 1 where D is the thickness of the at- 
mosphere. The model parameters w 2 and A 2 depend on the 
model time step size, in particular, uj is proportional to the 
time step size. In our numerical experiments the parameters 
were adjusted to their values for typical model resolutions in 
NWP with a Courant number of around 10. An important 
feature of the discretised PDE is the strong anisotropy in 
the vertical direction: the depth of the atmosphere is about 
two orders of magnitude smaller than the radius of the earth 
and consequently the horizontal grid spacing Ax is signifi- 
cantly larger than the vertical grid spacing Az. The strong 
grid aligned anisotropy in the vertical direction is given (ap- 
proximately) by 



7 2 = 



A Aa 



(2) 



and as H atmos C 1 we have Az -C Ax, so 7 2 1. This 
property can be used to construct a simple but very effi- 
cient and parallelisable preconditioner for iterative solvers 
of this equation, which solves the vertical equation exactly 
but ignores the horizontal couplings. 

The condition number of the preconditioned operator ap- 
proaches a fixed value as the horizontal resolution increases 
and does not diverge as for the Poisson equation. To 
see this, note that to keep the Courant number constant, 
the time step size At oc u) has to chosen to be propor- 
tional to the horizontal grid spacing, i.e. At oc Ax. The 
largest eigenvalue of the preconditioned matrix is of the or- 
der ui 2 /Ax 2 oc At 2 / Ax 2 and independent of Ax, whereas 
the smallest eigenvalue is 1 due to the presence of the zero 
order term in (JTJ. 

After discretisation the elliptic operator in (JT|) can be writ- 
ten abstractly in tensor product form as the sum of three 
terms: 

L = D {h) <g) M {r) + M w <g) D {r) + M {h) ® M (r) (3) 

Here and (M^ and M^) are horizontal (verti- 

cal) mass matrices which only contain couplings in the hor- 
izontal (vertical) direction. (D^) are second order 
derivate operators which contain couplings in the horizontal 
(vertical) direction. 

3.1 Discretisation 

To discretise ([TJ we use a finite volume scheme on a tensor- 
product grid, which consists of a (possibly unstructured but 
conforming) two dimensional grid on the unit sphere and a 
non-uniform (typically graded) one dimensional grid in the 
vertical direction. The model fields are defined as integrals 
in a grid cell given by the horizontal grid element T and 
vertical index k = 0, . . . , n z — 1 



7 (T) 



rk+l 



u(r, 0) r 2 dr d6 



(4) 



with 6 denoting horizontal coordinates on the unit sphere 
(throughout this work we use zero-based indexing as all our 



implementations are in the C programming language). In- 
dependent of the horizontal discretisation we need to store a 
vector u^ T ' of length n z at each element T. The vertical grid 
is defined by the grid points where k = 0, . . . , n z . The 
number of vertical grid cells is usually large, n z — 0(100). 
In meteorological applications a graded vertical grid with 
smaller grid spacings near the ground is desirable and we 
use r k = 1 + (k/n z ) 2 ■ H atmos . 

Equation ([lj is discretised using a finite volume scheme, 
and schematically it can be written for each horizontal grid 
element T as 



(Au) 



(T) 



Atu 



(T) 



E 

T'£j\f(T) 



At.t'U 



(T') 



/ 



(T) 



(3) 



where Af(T) is the set of neighbours of T. At is a tridiag- 
onal n z x n z matrix and At.t' are diagonal n z x n z matri- 
ces for each neighbouring element T". The matrices At.t' 
correspond to the off-diagonal couplings in the horizontal 
derivative operator D^ h ' <8> in fe| and are given by the 
product 



At,t' = cht,t' diag(rf) = oltj" 



/do 



V 



4 2 -i/ 



(6) 

The coefficient cut.t' is the ratio between the length St,t' 
of the edge between the cells T and T' and the distance be- 
tween the midpoints rj- and vt> of these cells. We also define 
the diagonal coefficient cut as the sum of the off-diagonal co- 
efficients 

GLT = Otx,T'- (7) 

The (symmetric) tridiagonal matrix At can be split into a 
sum of three terms: 



A/, 



"0 



b„ 

dr, 



(8) 



= \T\ diag(a) — c<t diag(d) 

+ \T\ tridiag (-(6 + c),6, c) , 

with dk = \T\(ak — bk~ Ck) — aijdk where |T| is the area of 
the horizontal grid element T . The first term corresponds 
to the product JlfW ® in The second term is the 
diagonal contribution of <g) M^) ; anc l the third term cor- 
responds to the vertical derivative term <£> D^ r \ While 
the finite volume discretisation described so far leads to a 
seven-point nearest-neighbour stencil on a regular grid, the 
same structure also arises for other stencil types which in- 
clude couplings between grid cells which are not directly 
adjacent. 

The vectors a, b, c and d do not depend on the grid cell T 
and can be precomputed. On the other hand the quantities 
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Figure 1: Structure of the matrix arising from the finite 
volume discretisation of for n z — 4 vertical columns. 
Vertical couplings are shown in red and horizontal couplings 
in blue. 



Figure 2: Cubed sphere grid. The model equation was dis- 
cretised on one logically rectangular panel of this grid. 



\T\ and olt,T' only need to be calculated once per vertical 
column. These two important observations will be exploited 
in the efficient matrix-free implementation of the PCG solver 
described below. 

3.2 Global matrix representation 

Assuming an ordering of the horizontal degrees of free- 
dom, which maps each cell T to a linear index v{T) G 
0, . . . , nhoriz — 1) one can write the full 3d solution vector 
of length n = rihoriz x n z as 



^{m'^M,...,^™)) 



with 



_(T) 
u i = u k 



where £ = n z ■ v(T) + k. 



(9) 



(10) 



In this representation the discretised equation ([5| can be 
written in the familiar form as 

Au = f (11) 

and the structure of the matrix A is shown in Fig. [T] Note 
that the matrix has a block structure, where each of the 
blocks corresponds to one combination (T,T r ) of neighbour- 
ing elements of the horizontal grid. Each of the gray blocks 
is of size n z xn z , the dark gray blocks (T' = T) describe the 
diagonal terms and vertical coupling, whereas the light gray 
blocks (T 7^ T' € JV(T)) describe the horizontal coupling. 
In the following we work on one panel of a cubed sphere grid 
with gnomonic projection (Fig. [2]). This defines a mapping 
from = [—1,1] x [—1,1] to spherical coordinates on one- 
sixth of the entire sphere and each cell of the regular grid of 
size nhoriz = m x m on f! can be labelled with an index pair 
(i, j) £ [0, m — 1] x [0, m — 1]. In this case we have 



where I = n z (m ■ i + j) + k 



(12) 



and label each grid cell Ty by its horizontal indices. We 
stress, however, that there is no algorithmic problem in ex- 
tending our approach to unstructured horizontal grids or to 
the entire spherical shell for example by ordering the hori- 
zontal grid cells along a space-filling curve. 



4 Iterative solvers for elliptic PDEs 

Typically the number of degrees of freedom per atmospheric 
variable in current global forecast models is at the order 
of several 10 millions. For next generation forecast models 
global horizontal model resolutions of around one kilometre 
are envisaged, which will require the solution of PDEs with 
more than 10 10 unknowns. Clearly naive direct methods 
can not be used for the solution of equations of this size 
and spectral methods often require a regular underlying grid 
structure and are hard to parallelise. 

Krylov subspace methods (see e.g. [T2]) are very efficient 
iterative algorithms for solving sparse linear systems, in par- 
ticular if they are suitably preconditioned. These methods 
construct the approximation of the exact solution u of 



(111 in a fc-dimensional Krylov-subspace 



span jr, Ar, A 2 r, . . . , A k Vj C 



(13) 



where r is the initial residual r = b — Au^°K They are easy 
to parallelise as in addition to local operations such as sparse 
matrix-vector multiplications and axpy-like vector updates 
they require only a small number of global reductions. 

4.1 Preconditioned Conjugate Gradient al- 
gorithm 

The simplest Krylov subspace method, which can be applied 
for symmetric positive definite matrices, is the Conjugate 
Gradient (CG) algorithm introduced in At each step 
the approximate solution vector is updated by adding 
a correction proportional to the search direction p( k \ The 
search directions are chosen such that they are A- orthog- 
onal for different k, k' : (p( k \ Ap( k ">) = for k ^ k' . The 
convergence rate depends on the spectral properties of the 
matrix A, in particular on the condition number k, which is 
the ratio between the largest and smallest eigenvalue, as de- 
rived in |12j . For the system arising from the discretisation 
of the Poisson equation, k grows rapidly with the inverse 



G 



grid spacing 1/h and it can be shown that asymptotically 
for h — > the number of iterations required to reduce the 
error by a factor e is 



k oc 



loge 



(14) 



For anisotropic systems h is the smallest grid spacing in the 
problem, for the highly anisotropic problem in this work this 
is the vertical grid spacing Az <C Ax. The dominant cost 
in each iteration is the sparse matrix-vector multiplication 
y <-h Ax, which is of 0(n) computational complexity. Hence 
the total cost of the algorithm is 



operation 


FLOPs MEM 


seal 


1 2 


axpy 




dot 


2 2 


nrm2 


2 1 


total (PCG) 


13 16 



Table 1: Number of floating point operations and memory 
references per grid cell for different level 1 BLAS operations. 
The last row shows the total number of FLOPs and memory 
references for all BLAS operations in the PCG algorithm. 



Cost(CG) oc ^loge. 



(15) 



To solve non-symmetric systems, more general Krylov space 
methods such as GMRES, BCG, BiCGStab and GCR can 
be used. Although the number of sparse matrix-vector 
products and intermediate vectors which need to be stored 
changes between different Krylov subspace algorithms, their 
general structure is very similar and in this work we focus 
on the Conjugate Gradient algorithm for simplicity. 



An equivalent version of the linear system in (111 can be 
obtained by multiplication with the inverse of the matrix 
M, 

M 1 Au = M l f. (16) 

This is generally referred to as left preconditioning. M is a 
matrix which should be easy to invert (i.e. it should be easy 
to solve the system M x = y) and as similar to A as possible, 
such that the preconditioned matrix Af _1 A is well condi- 
tioned. Usually these two requirements are mutually exclu- 
sive and a tradeoff between them has to be found. Initial 
profiling of the CPU code revealed that the sparse matrix- 
vector multiplication and preconditioner solve account for 
the largest proportion of the runtime (80 — 90%) . In addi- 
tion to these operations each iteration requires three axpy- 
like vector updates, one seal-operation and two scalar prod- 
ucts (dot). An additional norm calculation (nrm) is required 
for the evaluation of the stopping criterion. The number of 
floating points operations and memory references for the in- 
dividual level 1 BLAS operations is given in Tab. [TJ the 
total number of FLOPS per grid cell for all BLAS opera- 
tions is 13 and the number of memory references is 16, and 
hence this part of the algorithm is clearly memory bound. 

The strong anisotropy of the discretised PDE can be used 
to construct an efficient preconditioner: if the small hori- 
zontal couplings are ignored, the matrix A shown in Fig. 
[T] is block-diagonal with each block corresponding to one 
vertical column. Each of the tridiagonal blocks can be in- 
verted independently using the Thomas algorithm written 
down explicitly in |40_. More efficient block-SOR precon- 
ditioners can be used as well, and require the inversion of 
the same block-diagonal matrix plus an additional sparse 
matrix- vector product. 

This approach has been applied very successfully for the 
pressure solver in the dynamical core of several numerical 



weather- and climate prediction models, see |4Tjl5|l6"l,l42|. In 
particular the authors of jl] show the efficiency of a simple Id 
line relaxation in comparison to other preconditioners such 
as 2d ADI or a three dimensional pointwise SOR iteration. 
The good weak and strong scaling on up to 65536 cores of the 
HECToR Cray supercomputer and more than 10 10 degrees 
of freedom is demonstrated for the model equation ([lj in 

As discussed in section [3] the condition number of the 
preconditioned elliptic operator approaches a fixed value for 
physical choices of the parameter oj in ([TJ as the horizon- 
tal grid spacing tends to zero. Hence for these parameters 
Krylov subspace solvers for this equation are algorithmi- 
cally stable in the sense that the number of iterations does 
not diverge as the horizontal resolution increases. However, 
as shown in [35] the number of iterations and total solu- 
tion time can be reduced significantly by using (geometric) 
multigrid solvers. 

4.2 Matrix-free implementation 

Neither the matrix A nor the preconditioner matrix M need 
to be stored explicitly in the algorithm, it is sufficient to 
evaluate the sparse matrix-vector product y <-< Ax and to 
solve the equation Mx = y for x. For matrices arising 
from the discretisation of PDEs, such as the one discussed in 
section [3] the local matrix stencil only couples each grid cell 
to its neighbours. As memory access is significantly more 
expensive than floating point operations on GPUs, it will be 
beneficial to recalculate the stencil whenever it is needed in 
the sparse matrix-vector product or preconditioner solve. In 
each horizontal grid cell (i, j) the diagonal matrices Ay M ., 
in my (with T^j/ € J\f(Tij)) and the tridiagonal matrix Ax tj 
in fl8| are calculated from the precomputed vectors a, b, c 
and d. For efficiency, we instead store the vectors a', 6', c' 
and d with 



a k /d k , b' k = b k /d k , 



c k /d k , 



(17) 



as then the number of floating point operations in the sparse 
matrix-vector multiplication can be further reduced. To re- 
construct the matrix stencil the coefficients and oniy 
are also needed (we write oeij = olt^ and a^ji = ctxy ,T t , ^ 
for Ti>ji E Af(Tij) for simplicity). While these can be given 
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by relatively complicated algebraic expressions on complex 
geometries, such as the spherical grid considered in this ar- 
ticle, they only need to be calculated once in each vertical 
column and for large n z this will only lead to a small over- 
head. 

On the regular horizontal grid which we use in our imple- 
mentation the sparse matrix-vector product y <— i Ax can 
then be calculated in each grid cell (i, j, k) as 



Vijk «-i [ (K - b 'k ~ 4) ' \ T ij \ - a ij) ■ Xijk 

+ \ T ij\ ■ b 'k ■ Xi,j,k+i + \Tij\ ■ c' k ■ x itjik -i 

+ a i+lJ ' x i+l.j,k + ■ %i-l,j,k 



(18) 



(with possible modifications at the boundary of the domain) . 
For each (i,j,k) this requires 20 floating point operations 
and 12 memory references (7 loads for x, 1 store for y and 
4 loads for a' k , b' k , c' k and d k ). However, as the vectors a', 
b' , c! and d do not change from column to column these are 
likely to remain in cache, reducing the number of memory 
references to 8. Depending on the innermost loop two of the 
values of x (namely the ones belonging to the same vertical 
column which are needed at the next vertical level, i.e. Xij k 
and Xij.k+i) will most likely stay in cache, which reduces 
the number of memory references further to 6. In contrast 
14 floating point operations and 22 memory references are 
necessary if the matrix A is stored in compressed sparse row 
storage (CSR) format. Again, the actual number of memory 
references is likely to be smaller as some of the variables are 
cached. However, in this case caching is not possible for the 
matrix entries which vary from one three dimensional grid 
cell to the next. 

For the construction of the preconditioner we drop the 
second term in (JsJ, which couples different vertical columns, 

and write A^s' 1 '^ = b^'^ (using |l2| ) to implicitly con- 
vert between the global vector representation and the col- 
umn based representation in Q). For each column (i, j) the 
tridiagonal system At^x^ 1 ^ = defined by 



\Tij \ ■ d k ■ 



x k-l 



+ x^.{{a' h -b' h -^)-a ij ) (19) 



'fe+1 



Vk 



with &ij = a 



Tij | needs to be solved for x^'^ . This can be 
done efficiently in 0(n z ) time using the Thomas algorithm, 
which is essentially Gaussian elimination applied to a tridi- 
agonal system. The forward iteration requires 8 memory 
references at each step (loading the auxilliary vector (f>k-i 
and y^-i an d storing 4> k and y k '^ plus four loads for a' k , b' k , 
c! k and d k ). In each step of the backward iteration 4 memory 
references are needed. Hence the total number of memory 
references is 12. Again, some of the data might be kept in 
cache. If only the vectors a', b' , c' and d are cached the 



number of memory references reduces to 8. If in addition 
any data in the same vertical column can be kept in cache, 
the number of memory references reduces further to 5. As 
there are no dependencies in the horizontal direction, we can 
parallelise in this direction, i.e. the tridiagonal solve in each 
vertical column can be carried out independently. 

An additional advantage of the matrix free method is the 
fact that there are no matrix setup costs; the cost for pre- 
computing the vectors a', b', c', d and copying them to the 
device is neglible as these vectors only have length n z . 



4.3 Interleaved PCG algorithm 

Field vectors are accessed in different components of the 
PCG algorithm, for example the residual r is needed in the 
preconditioner solve, in the update of the residual and the 
calculation of the residual norm. In the standard implemen- 
tation of the algorithm these operations are carried out in 
separate loops over the grid, which increases the number of 
memory references as data can not be kept in cache. How- 
ever, the main iteration of the PCG algorithm can be rewrit- 
ten such that it only consists of two loops over the grid, each 
of which contains either the sparse matrix-vector multipli- 
cation or the tridiagonal solve and a number of BLAS op- 
erations. Similar loop fusion for the a GPU implementation 
of the PCG algorithm presented in [43; has been discussed 
in [2DJ. 

The main iteration of this modified algorithm is shown 
in Algorithm [T| The kernels are written down explicitly for 

Algorithm 1 Interleaved PCG loop 



for k = 1, maxiter do 

Interleaved preconditioner kernel: Calculate 

rfnr - aq, z = M~ 1 r 1 \\r\\ <-\ y/ (r, r), k <-h (r,z) 

in a single iteration over the grid. 

if (||r||/||ro|| < e or ||r|| < t) then Exit 

P <H K/Kold, Hold <-i K 

Interleaved SpMV kernel: Calculate 

uMtil ap, p<H2 + fip, q <-h Az + (3q, a (p, q) 

in a single iteration over the grid. 

a «-H K old /(T 

end for 



the matrix- free implementation in Appendix [X] The num- 
ber of floating point operations and memory references for 
the matrix-free PCG algorithm and for its interleaved ver- 
sion are shown in Tab. [2] For the memory references we 
give three different values corresponding to the following as- 
sumptions: (1) no data is kept in cache, (2) only the vectors 
a', 6', c' and d are cached and (3) in addition data in the 
same vertical column is cached. 

While the algorithm is still memory bound on the GPU, 
the number of memory references is reduced in the inter- 
leaved implementation, in particular if the cache can be used 
efficiently. 
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algo- 
rithm 


operation FLOPs 


Memory references 
no matrix columns 
cache cached cached 




bplvl V zU 
prec 13 
BLAS 13 
total 


12 8 5 
16 16 16 

/in QO 07 


Inter- 
leaved 
PCG 


SpMV 28 
prec 19 
total 47 


17 13 11 
16 12 9 
33 25 20 



Table 2: Number of floating point operations and memory 
references per iteration and per grid point for different com- 
ponents of the PCG and the interleaved PCG algorithm. 
Memory references are shown without caching, with caching 
the vectors a', b', c' and d only and assuming that all de- 
grees of freedom in a vertical column are cached as well. 



5 Graphics Processing Units 

In contrast to CPUs, on which most transistors are used 
for advanced execution control units and cache hierarchies, 
GPUs have a very large number (several hundreds to thou- 
sands) of lightweight compute cores which support SIMD 
parallelism for simple compute kernels and are ideally suited 
for floating point intensive calculations on regular data. The 
clock speed of each individual core is smaller than on the 
CPU, which improves the power efficiency. 

The cores in the GPU are grouped into a number of 
streaming multiprocessors (SMs). Data can be stored in 
global on-chip GPU memory and in addition, each SM has 
a smaller and faster shared memory. Each compute core 
can also access an even smaller local memory in addition to 
a set of registers. Constants can be stored in fast constant 
memory. Modern GPUs, such as the Fermi architecture (see 
|44|L also have a hardware managed L1/L2 cache hierachy. 
Data transfer between host and device memory has to be 
managed explicitly by the user. As typically the PCIe bus 
has a small bandwidth (at the order of ten GB/s), this is 
expensive and data should be calculated and kept on the 
GPU as long as possible. 

5.1 CUDA programming model 

The CUDA programming model described in the CUDA 
programming guide f|45|) provides an extension of the C 
language. When writing code for a GPU, the most compute 
intensive subroutines are parallelised by isolating them in 
simple kernels. On the GPU these kernels are executed by 
lightweight threads which are run on the compute cores of 
the SMs. Threads are grouped into blocks in a one-, two- 
or three dimensional grid, which execute independently on 
one SM. Synchronisation and data exchange is only possi- 
ble between threads in the same block. In each block the 
threads are arranged into a grid, i.e. each thread is uniquely 



identified by its thread index and block index. Threads are 
scheduled in groups of size 32, called warps. The threads in 
each warp execute one common instruction at a time; if the 
execution path diverges, for example due to an if-statement 
in the kernel, both branches are executed. This is known as 
thread divergence and should be avoided whenever possible. 

As our solver algorithm is memory bound, performance 
can be increased by minimising latency and increasing global 
memory bandwidth. If a warp stalls as it is waiting for data 
from memory, it is paused and the next warp which is ready 
for execution is launched by the warp scheduler. In con- 
trast to threads on a multicore CPU, the GPU scheduler 
is designed for launching and switching threads with mini- 
mal overheads. Thus, given the number of threads is large 
enough, memory latency can be hidden. To achieve this the 
GPU usually has to be oversubscribed, i.e. the number of 
threads launched at a single time should exceed the number 
of compute cores. 

For optimal efficiency, memory access for all threads in 
one half- warp should be coalesced. Global memory access is 
processed in segments of 128 bytes (which is the size of one 
cache line) and all threads in a half-warp should access the 
smallest possible number of different segments. For exam- 
ple, if each of the 16 threads reads a double precision (8 byte) 
floating point number from global memory, this will require 
the transfer of one segment if these numbers are stored con- 
secutively, but it will require 16 separate memory transfers 
and thus incur a big penalty if the numbers are more than 
128 byte apart in global memory. 



6 CUDA implementation 
PCG algorithm 



of the 



Both the standard PCG method and its interleaved version 
were implemented in C and CUDA-C. In the standard ver- 
sion the sparse matrix-vector multiplication and precondi- 
tioner were implemented as kernels and the CUBLAS library 
was used for the level 1 BLAS operations. 

To minimise memory transfers between host and device, 
data is kept on the device inside the entire PCG loop and 
all operations are carried out on the GPU. Host-device data 
transfers are only necessary for copying the initial solution 
and the right hand side to GPU memory before the CG it- 
eration and for copying the final solution back to the host 
at the end. In addition, the four vectors a', b' , c' and d 
describing the vertical discretisation need to be copied to 
the device once before the main PCG iteration. However, 
as each vector only has length n z , the amount of data tran- 
ferred is negligible in comparison to the size of the initial 
solution and right hand side vector. In particular it is sig- 
nificantly less than for matrix-explicit implementations. 
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6.1 Domain decomposition and data layout 

Due to the inherently sequential nature of the Thomas algo- 
rithm, parallelisation in the vertical direction is not possible 
for the preconditioner. Instead the code is parallelised by 
assigning each vertical column to one thread and organising 
these into threadblocks of size B — B x x B y , each of which 
is launched on one streaming multiprocessor. Parallelisation 
only in the horizontal direction is common in atmospheric 
models where data dependencies in the vertical direction are 
introduced by physical processes such as radiative transfer. 
To achieve a good occupancy the number of threads per 
block should be large, in particular latency hiding can only 
occur for B ^> 32. 

In the code all three dimensional field vectors, such as 
the solution vector u, are represented as one dimensional 
arrays of size n = m x m x n z which are stored contiguously 
in memory. To access the entry the three dimensional 
index (i,j,k) £ [0,m — 1] x [0, m — 1] x [0, n z — 1] (where 
k is the vertical index) has to be mapped to a linear index 
£ € {0, . . . ,n — 1}. To achieve optimal cache usage on the 
CPU, vertical columns are stored consecutively in memory 
in the C code, which can be achieved by the mapping 



£ (C) 



(m ■ i + j) + k 



(20) 



already introduced in (12 1. However, on a GPU, the data 
layout in ( 20 ) would lead to a significant amount of un- 
coalesced memory access as the number of vertical levels is 
large. For a typical n z of 128, consecutive threads will access 
data which is 128 x sizeof (float) bytes apart in memory. Al- 
though this problem can be mitigated by use of the LI cache 
(or manual prefetching into shared memory at the beginning 
of each kernel), on a GPU the LI cache is shared between 
a large number of threads, which severly limits the cache 
size per thread. As each thread processes an entire vertical 
column, efficient caching would only possible for relatively 
small horizontal block sizes B: in the Thomas algorithm 
two vectors of length n z need to be stored per column in 
addition to the four vectors a', b' , c' and d which describe 
the vertical discretisation. The total amount of LI cache on 
the Fermi architecture is limited to 48kB, hence for single 
precision floating point numbers one would have 



48 kB > (2B + A)n z x sizeof [float) 



(21) 



which would limit the number of threads per block to 44 for 
single precision and 22 for double precision calculations if 
we assume that n z = 128. 

An alternative approach is to change the storage format 



of the fields. Instead of (20 1 we use 



£ (CUDA-C) = m . („, . j + jfe) + i 



(22) 



in the matrix-free CUDA-C code, i.e. the first horizontal 
index runs fastest. This ensures that, provided the horizon- 
tal block size B x is larger than 16, memory access for all 
threads in a half-warp is coalesced, as illustrated in Fig. [3] 
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Figure 3: Memory layout and data access pattern. Data 
read by all threads in a warp is shown with a gray back- 
ground. If the vertically continuous ordering of the degrees 



of freedom in (20 1 is used, memory access is not coalesced 
between the threads in one warp (left). The ordering de- 
grees of freedom in (22 1 leads to coalesced memory access 



between all threads in a warp (right). 



In our numerical tests we found that the blocksize B x = 64, 
By = 2 gave good results, in particular the global load ef- 
ficiency was almost 100% for the interleaved preconditioner 
and larger than 88% for the SpMV kernel. For this blocksize 
the data processed by one block is too large to fit into cache. 
However, we find that the LI cache hit rate ranges between 
33% for the interleaved preconditioner and 56% for the inter- 
leaved SpMV kernel. The total global memory bandwidth 
is around 25% — 50% of the peak value for both kernels. 
To reduce the number of cache misses further, a more fine 
grained parallelisation would have to be used to reduce the 
data volume per thread and ensure that data used by each 
thread block can be kept in shared memory. 

The approach we are currently exploring (but which is 
not used in the implementation described in this article) is 
to use a different solver for the tridiagonal system in the 
vertical direction. The substructuring method discussed in 
|4"6"] splits the n z x n z problem into B z smaller tridiagonal 
systems of size ps (n z /B z ) x (n z /B z ), which can be solved 
independently by different threads. To obtain the global so- 
lution on the entire vector of length n z , a global tridiagonal 
system of size B z x B z for the interface points has to be 
solved before the solutions of the small subsystems can be 
combined to the total solution. 



6.2 Matrix-explicit implementation 

In addition to the matrix-free implementation described in 
section 14.21 we also wrote a version of the code based on an 
explicit representation of the matrix. Because of its popu- 
larity in the literature on sparse matrix-vector products and 
Krylov-subspace solvers (see the discussion in section [2]) we 
chose the CSR format. The n x n tridiagonal matrix used 
in the preconditioner was stored as a set of three vectors 
of length n. The cusparse{S I D}csrmv() function from the 
CUSPARSE library was used for the sparse matrix-vector 
multiplication and the subroutine cusparse-[S I D}gtsv() 
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for the tridiagonal solve in the preconditioncr. 

Setup of the matrix on the host and copying the CSR 
representation to the device would create additional costs 
as host-device data transfers are expensive. For a fair com- 
parison the matrices were set up on the device instead. Our 
numerical tests show that in this case the setup costs only 
form a small part of the total runtime. However, this might 
not be the case in other applications where matrix has to be 
constructed on the CPU. 

In addition we wrote a CPU version using our own hand- 
written CSR matrix-vector product and the LAPACK rou- 
tines GTTRF, GTTRS for the tridiagonal solver. 



7 Numerical experiments 

7.1 Hardware and compilers 

All runs were carried out on the GPU node of the aquila 
cluster in Bath. The node contains an Intel Xeon E5-2620 
Sandybridge CPU with a clockspeed of 2.00GHz and an 
nVidia Fermi M2090 GPU. The theoretical peak perfor- 
mance of one core of the Sandybridge CPU without AVX 
extensions is 8.0GFLOPs (4 floating point operations per 
cycle x 2.0 GHz). The M2090 GPU contains 512 cores run- 
ning at a clockspeed of 1.3GHz which are organised into 16 
streaming multiprocessors with 32 cores each, as described 
in the Fermi Architecture Whitepaper (01]). The total size 
of global GPU memory is 6GB, and each SM has 64kB of 
on-chip memory which can be split between shared mem- 
ory and the LI cache as 48kB/16kB or 16kB/48kB. In our 
implementation we used API calls to choose the optimal par- 
titioning. In addition the Fermi architecture has 768kB of 
global L2 cache. The theoretical peak performance is quoted 
as 1.331TFLOPs for single precision and 0.665 TFLOPs for 
double precision while the peak bandwidth for access to 
global memory is quoted as 177GB/sec. Dividing the peak 
performance by the peak bandwidth implies that around 30 
floating point operations can be carried out for each sin- 
gle/double precision variable read from global memory. In 
practise the actual number of floating point operations per 
accessed variable will of course be different due to latency 
effects and as required data might already be in LI cache. 
However, this again stresses the importance of optimising 
memory throughput to achieve good performance. 

In contrast, on the CPU the theoretical peak bandwidth 
is 41.6GB/s, and the number of floating point operations 
per value loaded from memory is 0.76 for single- and 1.54 
for double precision. 

The nVidia nvcc compiler (release 5.0, VO.2.1221) was 
used for compiling the CUDA code and we used version 4.4.6 
of the gnu C compiler for complilation of the CPU code. To 
achieve the best possible performance of the matrix-explicit 
CPU code, optimised BLAS and LAPACK libraries based on 
the OpenBLAS implementation (see [1?]) were used. AVX 
extensions were disabled on the CPU. 



matrix-explicit 

kernel 


time per call 
C CUDA 


speedup 
C vs. CUDA 


single precision 
SpMV 

preconditioner 


170.6 6.91 
205.3 12.50 


25 x 
16x 


double precision 
SpMV 

preconditioner 


182.2 10.91 
249.7 23.20 


17 x 
llx 



Table 3: Measured times and speedups for one sparse 
matrix-vector multiplication (SpMV) y <H Ax and one pre- 
conditioner solve x <-i M~ 1 y on the CPU and GPU using 
the matrix-explicit implementation. All times are given in 
milliseconds. The speedup of the CUDA-C code relative 
to the sequential CPU implementation is shown in the last 
column for each case. 



7.2 Results 

We first study the performance and speedup of the indi- 
vidual components of the PCG solver and of the entire al- 
gorithm for a fixed problem of size 256 x 256 x 128 with 
u 2 = 6.71 ■ 10~ 4 and A 2 = 3.32 • 10~ 2 , which is a typical set 
of parameters in meteorological applications. On the GPU 
we used a two-dimensional block layout and each block has 
a size of B x x B y = 64 x 2 threads. We found that this gives 
good results and varying the block size did not increase the 
performance. 

7.3 Matrix- vector multiplication and pre- 
conditioner 

In Tab. [3] the times for a single sparse matrix- vector multi- 
plication and preconditioner solve are shown for the matrix- 
explicit method. These times do not include any costs for 
setting up the matrix or for transferring data between host 
and device, as this is only required once for each PCG 
solve, which consists of a large number of sparse matrix- 
vector multiplications and preconditioner applications. The 
speedups shown in this table are relative to the sequential 
CPU implementation. Assuming that the CPU code can 
be parallelised perfectly, the socket-to-socket speedup is a 
factor 6 smaller as the CPU contains six processor cores. 

The corresponding times for the matrix-free implementa- 
tion are shown in Tab. [4] where we also show the speedup 
of the matrix-free CUDA-C code relative to the matrix- 
explicit CUDA-C code. On the CPU the matrix-free sparse 
matrix-vector multiplication is more than twice as fast as 
the matrix-explicit implementation. However, the matrix- 
explicit preconditioner is 25% — 30% faster than the corre- 
sponding matrix-free version. On the GPU the matrix-free 
code is significantly faster than the matrix-explicit version, 
both for the sparse matrix-vector multiplication where the 
speedup is 9.2x for single precision and 7.7x for double pre- 
cision. For the preconditioner the speedup is slightly smaller 
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matrix- free 

kernel 


time per call 
C CUDA 


speedup 
C vs. mat -free 
CUDA vs. CSR 


single precision 










opivl V 


78.5 


0.75 


105 x 


9.2x 


preconditioner 


252.6 


2.40 


105 x 


5.2x 


mtcrrva. Splvlv 


1 on n 


L. 10 


DUX 




interlvd. prec. 


253.1 


3.34 


76 x 




double precision 










opivi V 


80.7 


1.41 


57x 


7.7x 


preconditioner 


350.4 


3.75 


93 x 


6.2x 


interlvd. SpMV 


132.3 


3.86 


34 x 




interlvd. prec. 


351.3 


4.86 


72 x 





Table 4: Measured times and speedups for one sparse 
matrix-vector multiplication (SpMV) y <— i Ax and pre- 
conditioner solve y •H M~ 1 x on the CPU and GPU us- 
ing the matrix-free implementation. All times are given in 
milliseconds. The last two columns show the speedup of 
the matrix-free CUDA-C code relative to the corresponding 
sequential CPU implementation and relative to the matrix- 
explicit CUDA-C version. 



with 5.2 x and 6.2x for single- and double precision respec- 
tively. The speedup of both standalone matrix-free GPU 
kernels relative to the sequential CPU code is more than 
100 x in single precision. In double precision the speedup is 
93 x for the preconditioner and 57 x for the sparse matrix- 
vector multiplication. While the corresponding speedups for 
the interleaved kernels are smaller, their performance has to 
be judged in the context of the full PCG loop, which for 
the standalone kernels also contains several level 1 BLAS 
operations. 

We observe that for the matrix-free standalone SpMV ker- 
nel the double precision implementation takes about twice 
as long as the single precision version. This suggests that 
the implementation is bandwidth limited and less affected 
by cache efficiency, as the double precision version requires 
transferring twice as much data from global memory than 
the single precision implementation. This interpretation is 
also corroborated by the relatively high cache hit rate for 
this kernel reported in Tab. [7J The cache hit rate is smaller 
for the interleaved sparse matrix-vector multiplication and 
the preconditioner kernels, all of which show a smaller in- 
crease in the runtime between single and double precision. A 
similar drop of performance by nearly a factor of two when 
going from single to double precision can be observed for the 
matrix-explicit kernels, see Tab. [3] 

7.4 PCG algorithm 

We now analyse the performance of the entire PCG solver 
and break down the time spent in a single iteration of the 
algorithm. 



7.4.1 Total solution time 

We measured the time required to carry out in 100 PCG 
iterations, which is sufficient to reduce the residual by five 
orders or magnitude. Our measurements include the time 
for the matrix setup and data transfer between host and 
device. The results are listed for three different implemen- 
tations of the algorithm in Tab. [5] where we also calculated 
the speedup of the matrix-free CUDA-C code both relative 
to the C code and relative to the matrix-explicit GPU code. 
The matrix-free interleaved algorithm gives the best perfor- 
mance, with a speedup of a factor of 60 x (single precision) 
and 48 x (double precision) relative to the C code on the 
CPU. It outperforms the matrix-explicit GPU code by more 
than a factor of four. On the CPU the interleaved algorithm 
is only slightly faster than standard PCG for single preci- 
sion and even slightly slower for double precision. This is 
because in the sequential implementation most of the time 
(80% — 90%) is spent in the sparse matrix-vector multipli- 
cation and preconditioner kernels, so fusing the kernels can 
only give a speedup of no more than 10% — 20%. This is 
different on the GPU, as will be discussed in section [7. 4. 2| 
As expected the cost for setting up the vertical discreti- 
sation matrix (calculation of a', 6', c' and d) and copying 
it to the device turned out to be negligible (<C 1ms). For 
the matrix-explicit code the matrix setup time only accounts 
for a small proportion of the runtime; on the GPU the ma- 
trix setup time is 6% and 4% of the total solution time in 
single- and double precision respectively. Although for the 
matrix-free interleaved code host-device data transfer of the 
solution and right hand side vector takes up only a small 
part of the runtime (about 8% both in single- and double 
precision) , this is not true any longer if a smaller number of 
iterations is used for example to only reduce the residual by 
three orders of magnitude. We also found that on the CPU 
the total solution time can be reduced by a factor of around 
four if the Krylov- subspace solver is replaced by a geometric 
solver multigrid, as is discussed in |36j . Again, this would 
increase the relative importance of the host-device memory 
transfer. 

7.4.2 Time per iteration 

The time per iteration is given for the three different im- 
plementations of the PCG algorithm in Tab. [6] where we 
also calculate the speedup of the matrix-free implementa- 
tion relative to the matrix-explicit version. The speedup 
relative to the C implementation is 66 x for the matrix- free 
interleaved implementation in single precision and 52 x in 
double precision. In both cases it is more than four times 
faster than the matrix-explicit implementation on the GPU. 
Comparing the total time per iteration for the matrix-free 
and the interleaved implementation, we find that the ratio 
between the two is 0.69 in single precision and 0.63 in double 
precision, which should be compared to the corresponding 
ratio of memory references in Tab. [2j which is 33/40 = 0.83 
without caching or 20/27 = 0.74 assuming that the vectors 
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Table 5: Total solution time for different implementations of the PCG algorithm. Costs for matrix setup and host-device 
data transfer are listed separately and included in the total times. 100 iterations of the PCG main loop were carried out 
in all cases and all times are given in seconds. 
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Table 6: Time per iteration for different implementations of 
the conjugate gradient algorithm. All times are measured 
in milliseconds. 



a' , b' , c' and d and all data in one vertical column is cached 
(the ratio is 25/32 = 0.78 if only the a' , b' , c' and d are 
cached). To identify the remaining bottlenecks, these times 
are further broken down for the GPU code in Figs. [4] and 
[5] for single- and double precision arithmetic. Note that for 
the matrix- free code the BLAS operations, and in particular 
the axpy updates, make up a significant proportion of the 
time in the matrix-free code. The plot clearly shows the ben- 
efit of the interleaved implementation, which increases the 
performance by 28% in single precision and 35% in double 
precision. 

7.5 Absolute performance 

While comparing the runtime of the CUDA-C code to the 
corresponding sequential CPU implementation can give an 
idea of the achievable performance gains, it is somewhat 
arbitrary in that it depends on the exact CPU which is used 
for this comparison. For this reason we also quantified the 
absolute performance of the matrix- free CUDA-C code. 

Several performance indicators for the kernels of the 
matrix- free code are shown in Tab. [7] The global load 
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Figure 4: Time per iteration for different parts of the main 
CG loop on the device using single precision. The BLAS- 
operations were implemented with the CUBLAS library. 
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Figure 5: Time per iteration for different parts of the main 
CG loop on the device using double precision. The BLAS- 
operations were implemented with the CUBLAS library. 



efficiency measures the amount of coalesced global memory 
access and the LI hit rate quantifies the cache efficiency. For 
all kernels load efficiency is very high due to coalescence of 
global memory access as described in section |6.1| 

The floating point performance for one iteration of the 
matrix-free interleaved PCG algorithm is plotted for a range 
of problem sizes between 2.1 • 10 6 and 1.3 • 10 s in Fig. § The 
nVidia M2090 Fermi GPU has a global memory of 6GB, 
and as the PCG algorithm requires the storage of 5 field 
vectors, which limits the problem size to less than 3 • 10 s 
for single precision and 1.5 • 10 8 for double precision. In all 
cases the size of a vertical column was kept fixed at n z = 128. 
The performance is virtually independent of the problem size 
and about 70-80 GFLOPs for single- and 40-50 GFLOPs for 
double precision. As the algorithm is memory bound, a more 
relevant measure is the global memory band width which is 
shown for the interleaved kernels in Fig. [7j The bandwidth 
increases slightly with the problem size and 25% — 50% of 
the peak value could be achieved. 

8 Conclusions 

In this article we described the matrix-free implementation 
of a Preconditioned Conjugate Gradient solver for strongly 
anisotropic elliptic PDEs on a GPU. Equations of this type 
arise in many applications in atmospheric and geophysical 
modelling if the problem is discretised in "flat" geometries. 
In particular the semi-implicit semi-Lagrangian time dis- 
cretisation of the non-hydrostatic Euler equations in the dy- 
namical core of many weather- and climate prediction mod- 
els leads to a three dimensional PDE for the pressure cor- 
rection and due to the small thickness of the atmosphere the 
elliptic operator has a very strong anisotropy in the vertical 
direction. This anisotropy can be exploited to construct a 
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Table 7: Performance measurements of the different matrix- 
free kernels as reported by the nVidia visual profiler. The 
bandwidth is the DRAM load bandwidth. 
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Figure 6: Floating point performance for different problem 
sizes for the matrix-free interleaved PCG code on the GPU. 
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Figure 7: Global memory bandwidth (DRAM read band- 
width as reported by the nVidia Visual Profiler) for different 
problem sizes for the interleaved apply- and preconditioner 
kernels. 



simple and efficient preconditioner based on vertical line re- 
laxation. The dependencies in each vertical column require 
a horizontal domain decomposition, which has implementa- 
tions for the data- and thread layout on the GPU. 

As the performance of the algorithm is limited by the 
speed with which data can transferred from global memory 
to the compute units, it is important to reduce the number of 
memory references. We achieved this by using a matrix-free 
implementation which recalculates the local matrix stencil 
whenever it is needed instead of reading it from memory. In 
addition, we reduced the amount of data transfer by fusing 
several kernels in the PCG loop, which gave an additional 
improvement of 28% in single precision and 35% in double 
precision. In total we demonstrated that on an nVidia Fermi 
M2090 GPU the best matrix-free code achieved a speedup of 
60 x in single precision and 48 x for double precision, com- 
pared to the corresponding sequential implementation on an 
Intel Xeon E5-2620 Sandybridge CPU. In terms of absolute 
times, the residual for a problem with 8.3 • 10 6 degrees of 
freedom could be reduced by more than five orders of magni- 
tude in 0.62 seconds in single precision. The double precision 
implementation takes about 1.5 times longer, which demon- 
strates the good double precision performance of modern 
GPUs. Our matrix-free version is more than four times 
faster than a matrix-explicit GPU implementation based on 
the CUSPARSE and CUBLAS libraries using the CSR ma- 
trix format which clearly demonstrates the benefit of our ap- 
proach. We measured the absolute performance of our code 
for a range of problem sizes and achieved a global memory 
bandwidth of 25% — 50% of the peak rate for problem sizes 
between 2.1 • 10 6 and 1.3 • 10 s degrees of freedom. 

Global memory access was coalesced for the threads 
within a half warp by numbering the degrees of freedom 
such that the horizontal index runs fastest. This is differ- 
ent from CPU implementations where vertical columns are 



stored consecutively in memory for cache efficiency. While 
the specific implementation discussed in this article is based 
on a regular horizontal grid, our method can be applied to 
any three dimensional grid which can be written as the ten- 
sor product of possibly unstructured horizontal grid and a 
non-uniform one dimensional grid in the vertical direction. 

The achieved global memory bandwidth is a sizeable frac- 
tion of the peak value, and it can be theoretically improved 
by an additional factor 2x to 4x by making better use of 
the GPU cache or shared memory. This would require the 
parallelisation of the tridiagonal solver in the vertical direc- 
tion, and we are currently investigating the substructuring 
approach described in [46 . 

With the planned increase in weather- and climate model 
resolution, problems with more than 10 10 degrees of freedom 
need to be solved. Clearly this is not possible on a single 
GPU due to limited global memory size. To solve problems 
of this size hundreds of GPUs are necessary as each has 
limited memory. We are currently extending the algorithm 
to multi-GPU clusters will introduce additional bottlenecks: 
unless data can be copied directly between GPU memory, it 
has to be transferred to the host at each iteration before it 
can be sent through the standard MPI network. However, 
in this case only halo data needs to be exchanged between 
neighbouring devices and given that the local problem size 
is not too small, this is significantly less than the total data 
processed by one GPU. 

The PCG solver described in this work requires around 
hundred iterations to reduce the residual by five orders of 
magnitude. In contrast, multigrid solvers can achieve the 
same reduction in a much smaller number of iterations, as 
has been demonstrated in [35] for the elliptic PDE consid- 
ered in this article. The preconditioner used in this work 
can be used as a multigrid smoother and the only missing 
components are intergrid operators for restriction and pro- 
longation, which is the object of our current research. 
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A Interleaved PCG kernels 

The kernels for the interleaved PCG algorithm described in 
section [4~3] are shown in Algorithms [2] and [3j In the GPU im- 
plementation each thread calculates dot products and norms 
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in one column. To obtain global sums, these need to be 
reduced with an additional BLAS operation. However, as 
this operation only operates on two-dimensional (horizon- 
tal) vectors, its cost is very small (< 1% of the time per 
iteration) . 



Algorithm 2 Interleaved matrix-multiplication kernel. Si- 
multaneously calculate u <H u + ap, p <— i z + f3p, q 4-t 
Az + (3q, a <— i (p, q) in a single iteration over the grid. 

l: for i = 0,...,mdo 

2: for j = 0, . . . , m do 

3: Calculate ai'j> and |Ty| 

4: (7 4 — I 

5: for k = 0, . . . , n z — 1 do 

6: p* <-h Pijfe, <f <"H Qijfe, Z* <-H Zjjfc 

7: Ujjfc <-H Ujjfc + a ■ p* 

8: p* <-H /3 • + Z* , (?* <-H /3 • g* 

9: Pijfc -H P* 

10: 5g «h (« -b' k - c' k ) ■ \Tij\ - a t] ) ■ z* 

+ K ' ' + c 'k ' l^ijl ' z i,j,fe-l 

+ a i+l,j ' + a i-l,j ' z i-l,j,k 

+ ' ^i,j+l,jfe + CKi,i — 1 ' z ij-l,fc 

11: g* <-h + d k ■ Sq, a 4-i cr + p* • q* 

12: g i:#fc ^ i q* 

13: end for 
14: end for 
15: end for 
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